Add script to tune parameters #179

thomasfaingnaert · 2023-11-23T12:04:32Z

Add a simple script to perform an exhaustive search on the parameter space, based on #154.

For the RTX 4070, these are the results:

Optimal parameters:
NN: Any[128, 64, 32, 1, 4, GemmKernels.Kernel.matmul_singlestage]
NT: Any[128, 64, 32, 1, 4, GemmKernels.Kernel.matmul_singlestage]
TN: Any[128, 64, 32, 8, 1, GemmKernels.Kernel.matmul_singlestage]
TT: Any[128, 64, 32, 2, 4, GemmKernels.Kernel.matmul_singlestage]

plot.pdf

I've been gradually generalising the kernel by fixing any miscompilations / incorrect results that come up, but we still only explore a relatively small part of the design space (around 3.2 / (100 - 85) = 21.3% of the configurations that ought to work). At some point I'll have to look into the incorrect results and error-throwing configurations:

====================================================================================================
Overall configurations:
====================================================================================================
Total:                                     4000 configurations
----------------------------------------------------------------------------------------------------
Skipped due to invalid GemmKernels config: 3402 (85.0%)
Produced incorrect result:                 112 (2.8%)
Threw an error:                            359 (9.0%)
Successful runs:                           127 (3.2%)

@maleadt Could you have a look at this and run it on more recent hardware as well?

Some notes:

Using Octavian to generate the groundtruth on the CPU is pretty much a requirement if you don't want to wait around forever. Probably better to generate the groundtruth on the GPU instead.
I've noticed some configurations result in an illegal memory access. Seems to be a codegen issue, as adding @cuprintf resolves it. I've added an assert for now to mask the issue so we can test that configuration. (Under the assumption that that will not influence the run time too much...)

maleadt · 2023-11-23T20:55:17Z

RTX6000 Ada:

Optimal parameters:
NN: Any[64, 256, 128, 2, 8, GemmKernels.Kernel.matmul_singlestage]
NT: Any[128, 128, 128, 8, 1, GemmKernels.Kernel.matmul_singlestage]
TN: Any[128, 128, 128, 4, 4, GemmKernels.Kernel.matmul_singlestage]
TT: Any[128, 128, 128, 4, 2, GemmKernels.Kernel.matmul_singlestage]

plot.pdf

H100:

NN: Any[256, 128, 128, 2, 8, GemmKernels.Kernel.matmul_singlestage]
NT: Any[128, 128, 64, 4, 2, GemmKernels.Kernel.matmul_singlestage]
TN: Any[128, 256, 128, 4, 4, GemmKernels.Kernel.matmul_singlestage]
TT: Any[256, 64, 128, 2, 4, GemmKernels.Kernel.matmul_singlestage]

No plot, as it failed during make_plot:

N = 128
N = 256
ERROR: LoadError: ConfigError: Requested too many threads for this kernel: This kernel can be launched using at most 384 threads, while this configuration required 512
Stacktrace:
 [1] matmul(conf::GemmKernels.Config{(M = 256, N = 256, K = 256), (M = 256, N = 128, K = 128), 16, (M = 256, K = 1), (M = 8, K = 1), (K = 128, N = 2), (K = 8, N = 1), (M = 128, N = 1), (M = 4, N = 1), (M = 128, N = 16, K = 16), (M = 16, N = 16, K = 16), GemmKernels.Layout.UnsafeAlignedColMajor{Float16}, GemmKernels.Layout.UnsafeAlignedColMajor{Float16}, GemmKernels.Layout.Zero{Float32}, GemmKernels.Layout.UnsafeAlignedColMajor{Float32}, GemmKernels.Layout.Padded{GemmKernels.Layout.UnsafeAlignedColMajor{Float16}, 8}, GemmKernels.Layout.Padded{GemmKernels.Layout.UnsafeAlignedColMajor{Float16}, 8}, GemmKernels.Layout.UnsafeAlignedColMajor{Float32}, GemmKernels.Layout.UnsafeAlignedColMajor{Float32}, GemmKernels.Operator.WMMAOp{16, 16, 16, Float16, Float32}, true, true}, a::CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}, b::CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}, c::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, d::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}; transform_global_to_shared_a::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_global_to_shared_b::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_global_to_shared_c::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_shared_to_global_d::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_shared_to_regs_a::GemmKernels.Transform.Elementwise{var"#3#5"{Float16}}, transform_shared_to_regs_b::GemmKernels.Transform.Elementwise{typeof(identity)}, transform_shared_to_regs_c::GemmKernels.Transform.Elementwise{var"#4#6"{Float16}}, transform_regs_to_shared_d::GemmKernels.Transform.Elementwise{typeof(identity)}, epilogue::GemmKernels.Epilogue.Default, kernel::typeof(GemmKernels.Kernel.matmul_singlestage))
   @ GemmKernels ~/Julia/pkg/GemmKernels/src/matmul.jl:40
 [2] run_gemm(cf::Configuration, a::CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}, b::CuArray{Float16, 2, CUDA.Mem.DeviceBuffer}, c::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, d::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})
   @ Main ~/Julia/pkg/GemmKernels/configs/configs.jl:78
 [3] macro expansion
   @ ~/Julia/pkg/CUDA/src/profile.jl:308 [inlined]
 [4] make_plot(BLOCK_M::Int64, BLOCK_N::Int64, BLOCK_K::Int64, WARPS_M::Int64, WARPS_N::Int64, kernel::Function, transpose_a::Bool, transpose_b::Bool)
   @ Main ~/Julia/pkg/GemmKernels/tuning/tune-wmma.jl:151

maleadt · 2023-12-06T20:29:41Z

tuning/tune-wmma.sh

+# Sudo keep-alive
+while true; do
+    sleep 300
+    sudo -n true
+    kill -0 "$$" || exit
+done &> /dev/null &


Ha, cute! Maybe could be more reliable (e.g. when sudo is configured always to prompt again) to fork off a privileged process that waits for the parent though.

Extracted from #179

…meter-tuning

Instead of the uncertainty.

thomasfaingnaert force-pushed the tf/parameter-tuning branch from 6afd275 to c154582 Compare November 23, 2023 12:45

thomasfaingnaert force-pushed the tf/parameter-tuning branch 28 times, most recently from 30e6e26 to a50cfe6 Compare December 2, 2023 12:08

maleadt and others added 7 commits December 4, 2023 13:53

Don't re-instantiate.

f92717a

Simplify log output path.

6882b8b

Distribute tuning across multiple processes.

8d7843d

Fix/improve distributed execution.

bad0c56

Tweaks.

425ab49

More fine tuning.

9a782d7

Lock GPU clock speeds during tuning

54bcadb

maleadt reviewed Dec 6, 2023

View reviewed changes

thomasfaingnaert added a commit that referenced this pull request Dec 7, 2023

Check tile sizes in config

748816c

Extracted from #179

thomasfaingnaert mentioned this pull request Dec 7, 2023

Check tile sizes in config #180

Merged

thomasfaingnaert added a commit that referenced this pull request Dec 7, 2023

Check size limits of LocalArray

d52401d

Extracted from #179

thomasfaingnaert mentioned this pull request Dec 7, 2023

Check size limits of LocalArray #181

Draft

thomasfaingnaert added a commit that referenced this pull request Dec 7, 2023

Check tile sizes in config

7e244b0

Extracted from #179

thomasfaingnaert added a commit that referenced this pull request Dec 7, 2023

Check tile sizes in config (#180)

3052b52

Extracted from #179

thomasfaingnaert and others added 13 commits December 7, 2023 23:20

Extend set of WMMA operator shapes

e184391

Merge remote-tracking branch 'origin/tf/more-wmma-sizes' into tf/para…

a2791c4

…meter-tuning

Perform sweep over different WMMA shapes

900f2a4

Add NVML data to benchmark dataframe

e490dcc

Use serial profiling

1bacdfe

Allow selecting a GPU.

97e2e7f

Fix manifest loading.

88da82c

Remove LocalArray size check

83dccd7

Lock to max frequency by default

90e307a

Use time as stop criterion for plot

48467c5

Instead of the uncertainty.

Sleep if throttling is detected

abba6db

Run kernels in batches

cd6a190

Merge branch 'master' into tf/parameter-tuning

1bc133e

thomasfaingnaert marked this pull request as ready for review January 2, 2024 13:25

thomasfaingnaert merged commit f5bd0e3 into master Jan 3, 2024
1 check failed

thomasfaingnaert deleted the tf/parameter-tuning branch January 3, 2024 10:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to tune parameters #179

Add script to tune parameters #179

thomasfaingnaert commented Nov 23, 2023 •

edited

Loading

maleadt commented Nov 23, 2023

maleadt Dec 6, 2023

Add script to tune parameters #179

Add script to tune parameters #179

Conversation

thomasfaingnaert commented Nov 23, 2023 • edited Loading

maleadt commented Nov 23, 2023

maleadt Dec 6, 2023

Choose a reason for hiding this comment

thomasfaingnaert commented Nov 23, 2023 •

edited

Loading